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Abstract. The notion of the cover is a generalization of a period of a 
string, and there are linear time algorithms for finding the shortest cover. 
The seed is a more complicated generalization of periodicity, it is a cover 
of a superstring of a given string, and the shortest seed problem is of 
much higher algorithmic difficulty. The problem is not well understood, 
no linear time algorithm is known. In the paper we give linear time al- 
gorithms for some of its versions — computing shortest left-seed array, 
longest left-seed array and checking for seeds of a given length. The algo- 
rithm for the last problem is used to compute the seed array of a string 
(i.e., the shortest seeds for all the prefixes of the string) in 0{n^) time. 
We describe also a simpler alternative algorithm computing efficiently 
the shortest seeds. As a by-product we obtain an 0(n log (n/m)) time 
algorithm checking if the shortest seed has length at least m and finding 
the corresponding seed. We also correct some important details missing 
in the previously known shortest-seed algorithm (Iliopoulos et al., 1996). 

1 Introduction 

The notion of periodicity in strings is widely used in many fields, such as com- 
binatorics on words, pattern matching, data compression and automata theory 
(see [13, 14]). It is of paramount importance in several applications, not to talk 
about its theoretical aspects. The concept of quasiperiodicity is a generaliza- 
tion of the notion of periodicity, and was defined by Apostolico and Ehrenfeucht 
in [2]. In a periodic repetition the occurrences of the period do not overlap. In 
contrast, the quasiperiods of a quasiperiodic string may overlap. 
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* * The author is supported by grant no. N206 566740 of the National Science Centre. 



We consider words (strings) over a finite alphabet E, u e S*; the empty 
word is denoted by e; the positions in u arc numbered from 1 to By 17" we 
denote the set of words of length n. By we denote the reverse of the string 
u. For u = U1U2 ■ ■ ■ Un, let us denote by u[i . .j] a factor of u equal to u, . . . Uj 
(in particular u[i] = u[i . . i]) . Words u[l . . i] are called prefixes of u, and words 
w[i . . n] are called suffixes of u. Words that are both prefixes and suffixes of u 
are called borders of u. By border(M) we denote the length of the longest border 
of u that is shorter than u. We say that a positive integer p is the (shortest) 
period of a word u = Ui . . .Un (notation: p = per(u)) if p is the smallest positive 
number, such that Ui = Ui+p, for i = 1, . . . , n — p. It is a known fact [6, 8] that, 
for any string u, per(u) + border(u) = \u\. 

We say that a string s covers the string u if every letter of u is contained in 
some occurrence of s as a factor of u. Then s is called a cover of u. We say that 
a string s is: a seed of u if s is a factor of u and m is a factor of some string w 
covered by s; a left seed of u if s is both a prefix and a seed of u; a right seed, of 
u if s is both a suffix and a seed of u (equivalently, s^ is a left seed of u^). Seeds 
were first defined and studied by Iliopoulos, Moore and Park [11], who gave an 
0(n log n) time algorithm computing all the seeds of a given string u G E", in 
particular, the shortest seed of u. 

By cover('u), seed(w), Iseed('u) and rseed('u) we denote the length of the short- 
est: cover, seed, left seed and right seed of u, respectively. By covermax(u) and 
Iseedmax(M) we denote the length of the longest cover and the longest left seed 
of u that is shorter than u, or if none. 

For a string u € 17", we define its: period array P[l . . n], border array B[l . . n], 
suffix period array P'[l . . n], cover array C[l . . n], longest cover array C^[l . . n], 
seed array Seed[l..n], left-seed array LSeed[l..n], and longest left-seed array 
LSeed^[l . .n] as follows: 

P\i] = per(u[l . . i]), B\i] = border(u[l . . i]), 

P'[i] = per{u[i . . n]), C[i] = cover(M[l . . «]), 

C^[i] = covermax(u[l . .i]), Seed[i] = seed(u[l . .i]), 

LSeed[i] = lseed(u[l ..i]), LSeed^[i] = lseedmax(u[l ..i]). 
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Table 1. An example string together with its periodic and quasiperiodic arrays. Note 
that the left-seed array and the seed array are non-decreasing. 



The border array, sufEx border array and period array can be computed in 0{n) 
time [6,8]. Apostolico and Breslauer [1,4] gave an on-line 0(n) time algorithm 
computing the cover array C[l . .n] of a string. Li and Smyth [12] provided an 
algorithm, having the same characteristics, for computing the longest cover ar- 
ray [1 . . n] of a given string. Note that the array enables computing all 
covers of all prefixes of the string, same property holds for the border array B. 
Unfortunately, the LSeed*^ array does not share this property. 

Table 1 shows the above defined arrays for u = abaabaaabbaabaab. For 
example, for the prefix u[l . . 13] the period equals 11, the border is ab, the 
cover is abaabaaabbaab, the left seed is abaabaaabba, the longest left seed is 
abaabaaabbaa, and the seed is baabaaab. 

We list here several useful (though obvious) properties of covers and seeds. 

Observation 1 

(a) A cover of a cover of u is also a cover of u. 

(h) A cover of a left (right) seed of u is also a left (right) seed of u. 

(c) A cover of a seed of u is also a seed of u. 

(d) If 11 is a factor of v then seed(u) < seed{v). 

(e) If u is a prefix of v then lseed{u) < lseed{v). 

(f) If s and s' are two covers of a string u, \s'\ < \s\, then s' is a cover of s. 

(g) If s is the shortest cover or the shortest left seed or the shortest seed of a 
string u then per{s) > \s\/2. 

For a set X of positive integers, let us define the maxgap of X as: 

maxgap(X) = max{6 — a : a,b are consecutive numbers in X} or if \X\ < 1. 

For example maxgap({l, 3, 8, 13, 17}) = 5. 

For a factor v of u, let us define Occ{v, u) as the set of starting positions of 
all occurrences of v in u. By first{v) and last(v) we denote min Occ{v,u) and 
max Occ{v, u) respectively. For the sake of simplicity, we will abuse the notation, 
and denote maxgap(u) = maxgap(Occ(u, u)). 
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Fig. 1. The word s = abaa is a border seed of w = aabaababaabaaba. 



Assume s is a factor of u. Let us decompose the word u into W1W2W3, where W2 
is the longest factor of u for which s is a border, i.e., W2 = u[first{s) . . {last{s) + 
\s\ — 1)]. Then we say that s is a border seed of u if s is a seed of Wi ■ s -Ws, see 
Fig. 1. The following fact is a corollary of Lemma 4, proved in Section 2. 



Fact 2 Let s be a factor ofu€E*. The word s is a border seed of u if and only 
if \s\ > mBDi{P\first{s) + \s\-l\, P'llastis)]). 



Notions of maxgaps and border seeds provide a useful characterization of seeds. 

Observation 3 Let s be a factor of u € S* . The word s is a seed of u if and 
only if\s\ > maxgap{s) and s is a border seed of u. 

Several new and efficient algorithms related to seeds in strings are presented in 

this paper. Linear time algorithms computing left-seed array and longest Icft-sced 
array are given in Section 2. In Section 3 we show a linear time algorithm finding 
seed-of-a-given-length and apply it to computing the seed array of a string in 
0{n^) time. Finally, in Section 4 we describe an alternative simple O(nlogn) 
time computation of the shortest seed, from which we obtain an 0(n log (n/m)) 
time algorithm checking if the shortest seed has length at least m (described in 
Section 5). 

2 Computing Left- Seed Arrays 

In this section we show two 0(n) time algorithms for computing the left-seed 
array and an 0{n) time algorithm for computing the longest left-seed array of a 
given string u S -S'". start by a simple characterization of the length of the 
shortest left seed of the whole string u — see Lemma 5. In its proof we utilize the 
following auxiliary lemma which shows a correspondence between the shortest 
left seed of u and shortest covers of all prefixes of u. 

Lemma 4. Let s he a prefix of u, and let j be the length of the longest prefix of 
u covered by s. Then s is a left seed of u if and only if j > per{u). 

In particular, the shortest left seed s of u is the shortest cover of the corre- 
sponding prefix ^[1 . . j] . 

Proof. If s is a left seed of u then there exists a prefix p of s of length at 
least n — i which is a suffix of u (sec Fig. 2). Wc use here the fact, that u[l . . j] is 
the longest prefix of u covered by s. Hence, p is a border of u, and consequently 
border(u) > \p\ > n — j. Thus we obtain the desired inequality j > per(u). 
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Fig. 2. Illustration of part (=>) of Lemma 4. 

The inequality j > per(M) implies that v = u[l . .j] is a left seed of u 
(see Fig. 3). Hence, by Observation lb, the word s, which is a cover of v, is also 
a left seed of u. 



Finally, the "in particular" part is a consequence of Observation 1, parts b 
and f. □ 




Fig. 3. Illustration of part (<;=) of Lemma 4. 



Lemma 5. Let u € S" and let C[l . .n] he its cover array. Then: 

lseed{u) = mm{C[j] : j > per{u)}. (1) 

Proof. By Lemma 4, the length of the shortest left seed of u can be found 
among the values C[per(u)], . . . ,C[n]. And conversely, for each of the values C[j] 
for per(M) < j < n, there exists a left seed of u of length C[j]. Thus Iseed(u) 
equals the minimum of these values, which yields the formula (1). □ 

Clearly, the formula (1) provides an 0{n) time algorithm for computing the 
shortest left seed of the whole string u. We show that, employing some algorith- 
mic techniques, one can use this formula to compute shortest left seeds for all 
prefixes of u, i.e., computing the left-seed array of u, also in 0{n) time. 

Theorem 1. For u G , its left-seed array can he computed in 0{n) time. 

Proof. Applying (1) to all prefixes of u, we obtain: 

LSeed[i] = min{C[j] : P[{\ <j< i}. (2) 

Recall that both the period array P[l . . n] and the cover array C[l . . n] of m can 
be computed in 0{n) time [1,4, 6, 8]. 

The minimum in the formula (2) could be computed by data structures for 
Range-Minimum-Queries [9, 15], however in this particular case we can apply a 
much simpler algorithm. Note that P[i — 1] < P[i], therefore the intervals of the 
form [P[z],i] behave like a sliding window, i.e., both their endpoints are non- 
decreasing. We use a bidirectional queue Q which stores left-minimal elements 
in the current interval [P[i],i] (w.r.t. the value C[j]). In other words, elements 
of Q are increasing and if Q during the step i contains an element j then j €E 
[P[i],i] and C[j] < C[j'] for all j < j' < i. We obtain an 0{n) time algorithm 
ComputeLeftSeedArray. □ 



ALGORITHM ComputeLeftSeedArray('u) 


1 


P[l . . n] := period array of u; C[l . . n] := cover array of u; 


2 


Q := emptyBidirectionalQueue; 


3 


for j := 1 to n do 


4 


while (not empty (Q)) and {front{Q) < P[i]) do popFront{Q); 


5 


while (not empty{Q)) and (C[6acA;(Q)] > C[i]) do popBack{Q); 


6 


pushBack{Q, i); 


7 


LSeed[i] := C[/ronf(g)]; 


8 


{ Q stores left-minimal elements of the interval [P [«],«] } 


9 


return LSeed[l . . n]; 



Now we proceed to an alternative algorithm computing the left-seed array, which 
also utilizes the criterion from Lemma 4. Wc start with an auxiliary algorithm 
ComputeR- Array. It computes an array R[l . . n] which stores, as R[i], the length 
of the longest prefix of u for which u[l . .i] is the shortest cover, if none. 



ALGORITHM ComputeR-Array(M) 

1; C[l . . fi] cover array of u; 

2: for i := 1 to n do R\i] := 0; 

3: for i := 1 to n do R[C[i]] := i; 

4: return R[l . . ?)]: 



The algorithm Alternative-ComputeLeftSeedArray computes the array LSeed 
from left to right. The current value of LSeed [z] is stored in the variable Is, note 
that this value never decreases (by Observation le). Equivalently, for each i we 
have LSeed [z — 1] < LSeed [z] < i. 

The particular value of LSeed [i] is obtained using the necessary and sufficient 
condition from Lemma 4: LSeed [i] = Is if Is is the smallest number such that 
l^'^l > per(M[l . .i]) = P[i], where w is the longest prefix of u[l . . i] that is covered 
by u[l . .Is]. We slightly modify this condition, substituting w with the longest 
prefix w' of the very word u that is covered by u[l . .Is]. Thus we obtain the 
condition R[ls] > P[i] utilized in the pseudocode below. 



ALGORITHM Alternative-ComputeLeftSeedArray(w) 


1 


P[l . . n] := period array of u; R[l . . n] := ComputeR- Array (w); 


2 


LSeed[0] := 0; Is := 0; 


3 


for i 1 to n do 


4 


{ An invariant of the loop: Is = LSeed [i — 1]. } 


5 


while R[ls] < P[i] do Is := Is + 1; 


6 


LSeed [i] := Is; 


7 


return LSeed [1 . . n]; 



Theorem 2. Algorithm Alternative- ComputeLeftSeedArray runs in linear time. 



Proof. Recall that the arrays P[l . .n] and C[l . .n] can be computed in linear 
time [1, 4, 6, 8]. The array R[l . .n] is obviously also computed in linear time. 

It suffices to prove that the total number of steps of the while-loop in the 
algorithm Alternative-ComputeLeftSeedArray is linear in terms of n. In each 
step of the loop, the value of Is increases by one; this variable never decreases 
and it cannot exceed n. Hence, the while-loop performs at most n steps and the 
whole algorithm runs in 0(n) time. □ 

Concluding this section, we describe a linear-time algorithm computing the 
longest left-seed array, LSeed*'^[l . . n], of the string u G i^". The following lemma 
gives a simple characterization of the length of the longest left seed of the whole 
string u. 

Lemma 6. Let u G If per{u) < n then lseedmax{u) = n — 1, otherwise 
lseedmax{u) — 0. 

Proof. First consider the case per(u) = n. We show that Iseed(u) = n, conse- 
quently Iseedmax(u) equals 0. Assume to the contrary that Iseed(u) < n. Then, 
a non-empty prefix of the minimal left seed of u, say w, is a suffix of u (consider 
the occurrence of the left seed that covers u[n\). Hence, n — |w| is a period of u, 
a contradiction. 

Assume now that per(u) < n. Then m is a prefix of the word u[l . . per(u)] • 
u[l . .n — 1] which is covered by u[l . . n — 1]. Therefore u[l . . n — 1] is a left seed 
of u, Iseedmax(u) > n — 1, consequently Iseedmax(u) = n—1. □ 

Using Lemma 6 we obtain LSeed^^[z] = i — 1 or LSeed*^[i] = for every i, 
depending on whether P[i] <i or not. We obtain the following result. 

Theorem 3. Longest left-seed array of u G can be computed in 0{n) time. 
3 Computing Seeds of Given Length and Seed Array 

In this section we show an O(n^) time algorithm computing the seed array 
Seed[l . . n] of a given string u € note that a trivial approach — computing 
the shortest seed for every prefix of u — yields 0(n^ logn) time complexity. In 
our solution we utilize a subroutine: testing whether u has a seed of a given 
length k. The following theorem shows that this test can be performed in 0{n) 
time. 

Theorem 4. It can be checked whether a given string u G i7" has a seed of a 

given length k in 0{n) time. 

Proof. Assume we have already computed in 0{n) time the suffix array SUF and 
the LCP array of longest common prefixes, see [6]. In the algorithm we start by 
dividing all factors of u of length k into groups corresponding to equal words. 
Every such group can be described as a maximal interval [i . . j] in the suffix ar- 
ray SUF, such that each of the values LCP[i -|- 1], LCP[i -|- 2], . . . , LCP[j] is at least 



k. The collection of such intervals can be constructed in 0{n) time by a single 
traversal of the LCP and SUF arrays (lines 1-9 of Algorithm SeedsOfAGiven- 
Length). Moreover, using Bucket Sort, we can transform this representation into 
a collection of lists, each of which describes the set Occ{v,u) for some factor v 
of It, u e S'' (lines 10-11 of the algorithm). This can be done in linear time, 
provided that we use the same set of buckets in each sorting and initialize them 
just once. 

Now wc process each of the lists separately, checking the conditions from 
Observation 3: in lines 14 18 of the algorithm we check the "maxgap" condition, 
and in line 19 the "border seed" condition, employing Fact 2. 

Thus, having computed the arrays SUF and LCP, and the period arrays 
P[l . .n] and P'[l . .n] of u, we can find all seeds of u of length k in 0{n) to- 
tal time. □ 



ALGORITHM SeedsOfAGivenLength(u, k) 



10 
11 
12 
13 
14 
15 
16 
17 
18 
19 

20: 



P[l . .n] :— period array of u; P'[l . . n] :— suffix period array of u; 
SUF[1 . . n] := suffix array of u; LCP[1 . . n] := Icp array of u; 
Lists := emptyList; 

j ■■= 1; 

while j < n do 
List ~ {SUF[j]}; 

while j < n and LCP[j + 1] > fc do 

J '■= J + 1; List := append{List, SUF[j]); 
j :— j + 1; Lists :— append {Lists, List); 
for all List in Lists do 

BucketSort(ijst); { using the same set of buckets } 
for all List in Lists do 

first := prev := n; last := 1; covers := true; 
for all i in List do 

first := mm{first,i); last := max(Zast,i); 
if i > prev + k then 

covers := false; 
prev := i; 

if covers and (k > niax(P[first + fc — 1], P'[last])) then 
print "u[first . . {first + A; — 1)] is a seed of w" ; 



We compute the elements of the seed array Seed[l . . n] from left to right, i.e., in 
the order of increasing lengths of prefixes of u. Note that Seed[i + 1] > Seed[i] 
for any 1 < j < n — 1, this is due to Observation Id. If Seed[i + 1] > Seed[z] 
then wc increase the current length of the seed by one letter at a time, in total 
at most n — 1 such operations are performed. Each time we query for existence 
of a seed of a given length using the algorithm from Theorem 4. Thus we obtain 
O(n^) time complexity. 

Theorem 5. The seed array of a string u G Z"" can be computed in O(n^) time. 



4 Alternative Algorithm for Shortest Seeds 



In this section we present a new approach to shortest seeds computation based 
on very simple independent processing of disjoint chains in the suffix tree. It 
simplifies the computation of shortest seeds considerably. 

Our algorithm is also based on a slightly modified version of Observation 3, 
formulated below as Lemma 7, which allows to relax the definition of maxgaps. 
Wc discuss an algorithmically easier version of maxgaps, called prefix maxgaps, 
and show that it can substitute maxgap values when looking for the shortest 
seed. 

We start by analyzing the "border seed" condition. Wc introduce somewhat 
more abstract representation of sets of factors of u, called prefix families, and 
show how to find in them the shortest border seeds of u. Afterwards the key 
algorithm for computing prefix maxgaps is presented. Finally, both techniques 
are utilized to compute the shortest seed. 

Let us fix the input string u G U". For v G I!*,hy PREF{v) we denote the 
set of all prefixes of v and by PREF{v, k) we denote PREF{v) Ci S^S* {limited 
prefix subset). 

Let T he & family of limited prefix subsets of some factors of m, we call T 
a prefix family. Every clement PREF{v, k) E T can be represented in a canon- 
ical form, by a tuple of integers: {first{v), last{v),k, \v\). Such a representation 
requires only constant space per element. By bseed(M, .F) we denote the shortest 
border seed of u contained in some element of J^. 

Example 1. Let u = aabaababaabaaba be the example word from Fig. 1. Let: 

T = {PREF{BbB.Bb, 4), PREF{hBbB.a., 4)} = {(2, 10, 4, 5), (6, 6, 4, 5)}. 

Note that \JT = {abaa, abaab,baba, babaa}. Then bseed(u, .F) = abaa. 

The proof of the following fact is present implicitly in [11] (type- A and type-B 
seeds). 

Theorem 6. Let u G 1^" and let F he a prefix family given in a canonical form. 
Then bseed{u, J-") can be computed in linear time. 

Alternative proof of Theorem 6. There is an alternative algorithm for 
computing bseed('u, J^), based on a special version of Find-Union data structure. 
Recall that B[l..n] is the border-array of u. Denote by FirstGE{I,c) {first- 
greater- equal) a query: 

FirstGE{I,c) = m.m{i : i G X, B[i] > c}, 

where I is a subintcrval of [1 . . n]. Wc assume that min0 = +oo. A sequence of 
linear number of such queries, sorted according to non- decreasing values of c, can 
be easily answered in linear time, using an interval version of Find-Union data 
structure, see [7,10]. The following algorithm applies the condition for border 
seed from Fact 2 to every element of T, with P\first{s) -|- |s| — 1] substituted by 
first{s) + \s\-l- B\first{s) + \s\ - 1]. We omit the details. □ 



ALGORITHM ComputeBorderSeed(w, J) 


1 


hseed := +oo; 


2 


for all {first{v), last{v), k,\v\) in J^, in non-decreasing order of first{v) do 


3 


k :— max.{P' [last (v)], k); 


4 


I ■- [first{v) + fc - 1, first{v) + \v\ - 1]; 


5 


pos ■- FirstGE(I, first(v) - 1); 


6 


bseed := mm{bseed, pos — first {v) + 1); 


7 


return bseed; 



Computation of the shortest seeds via prefix maxgaps. Let T{u) be 
the suffix tree of u, recaii tiiat it can be constructed in 0(n) time [6,8]. By 
Nodes{u) we denote the set of factors of u corresponding to expiicit nodes of 
T{u), for simpiicity we identify the nodes with the strings they represent. For 
V S Nodes(u), the set Occ{v,u) corresponds to leaf Ust of the node v (i.e., the 
set of vahies of leaves in the subtree rooted at v), denoted as LL[v). Note that 
first{v) = min LL(v) and last{v) = ma,x LL{v), and such values can be computed 
for all V G Nodes{u) in 0{n) time. For v G Nodes{u), we define the prefix maxgap 
of V as: 

A{v) = max{maxgap(w) : w € PREF{v)}. 

Equivalently, A{v) is the maximum of maxgap values on the path from v to the 
root of T{u). We introduce an auxiliary problem: 

Prefix MELxgap Problem: 

given a word u G compute A{v) for all v G Nodes{u). 

The following lemma (an alternative formulation of Observation 3) shows that 
prefix maxgaps can be used instead of maxgaps in searching for seeds. This is 
important since computation of prefix maxgaps A{v) is simple, in comparison 
with maxgap(L') this is due to the fact that the A{v) values on each path 
down the suffix tree T{u) are non-decreasing. Efficient computation of maxgap(u) 
requires using augmented height-loalanced trees [5] or other rather sophisticated 
techniques [3]. The shortest-seed algorithm in [11] also computes prefix maxgaps 
instead of maxgaps, however this observation is missing in [11]. 

Lemma 7. Let s be a factor o/ u G E* and let w be the shortest element of 
Nodes{u) such that s G PREF{w). The word s is a seed of u if and only if 
\s\ > A{w) and s is a border seed of u. 

Proof. If s corresponds to an element of Nodes{u), then s = w. Otherwise, 
s corresponds to an implicit node in an edge in the suffix tree, and w is the 
lower end of the edge. Note that in both cases we have A{w) > maxgap(w) = 
maxgap(s). By Observation 3, this implies part (<^=) of the conclusion. As for the 
part (=>), it suffices to show that jsj > A{w). 



Assume, to the contrary, that |s| < A{w). Let v G PREF{w) fl Nodes (u) be 
the word for which maxgap(w) = and let a,b be consecutive elements of 

the set Occ(v, u) for which a + maxgap(t;) = b. 

Let us note that no occurrence of s starts at any of the positions a+1, . . . , 6—1. 
Moreover, none of the suffixes of the form u[i ■ .n], for a + l<?<6 — 1, isa 
prefix of s. Indeed, u is a prefix of s of length at most n — 6 + 1, and such an 
occurrence of s (or its prefix) would imply an extra occurrence of v. Note that 
at most \s\ < 6 — a — 1 first positions in the interval [a, h] can be covered by an 
occurrence of s in u (at position a or earlier) or by a suffix of s which is a prefix 
of u. Hence, position 6 — 1 is not covered by s at all, a contradiction. □ 

By Lemma 7, to complete the shortest seed algorithm it suffices to solve the 
Prefix Maxgap Problem (this is further clarified in the ComputeShortestSeed 
algorithm below). For this, we consider the following problem. By SORT{X) we 
denote the sorted sequence of elements of X C {1, 2, . . . , n}. 

Chain Prefix Maxgap Problem 

Input: a family of disjoint sets Xi, X2, . . . , C {1, 2, . . . , n} 
together with SORT{Xi U X2 U . . . U X^). 

The size of the input is m = '^\Xi\. 
Output: the numbers Ai = maxj<i maxgap(Xj U X^+i U . . . U X^). 

Theorem 7. The Chain Prefix Maxgap Problem can be solved in 0{m) time 
using an auxiliary array of size n. 

Proof. Initially we have the hst L = SORT{Xi U X2 U . . . U X/e). Let pred and 
sue denote the predecessor and successor of an element of L. The elements of 
L store a Boolean flag marked, initially set to false. In the algorithm we use an 
auxiliary array pos [1 . .n] such that pos [i] is a pointer to the element of value i in 
L, if there is no such element then the value ofpos[i] can be arbitrary. Obviously 
the algorithm takes 0(m) time. □ 



ALGORITHM ChainPrefixMaxgap(L) 


1 


Ai := maxgap(L); { naive computation } 


2 


for j :— 2 to k do 


3 


:= Aj.v, 


4 


for all i in X,_i do marked{pos[i]) := true; 


5 


for all i in do 


6 


p :— pred{pos[i]); q := suc{pos[i]); 


7 


if (p / nil) and [q / nil) and (not marked{p)) 




and (not marked{q)) then 


8 


Aj := max{Aj, value{q) — value{p)); 


9 


delete{L, pos[i]); 



Theorem 8. The Prefix Maxgap Problem can be reduced to a collection of Chain 
Prefix Maxgap Problems of total size 0(n log n). 



Proof. We solve a more abstract version of the Prefix Maxgap Problem. We are 
given an arbitrary tree T with n leaves annotated with distinct integers from 
the interval [1, n], and we need to compute the values A(v) for all v € Nodes(T), 
defined as follows: maxgap(t') = maxgap(LL(t!)), where LL{v) is the leaf list of 
V, and ^(w) is the maximum of tlic values maxgap on the path from v to the 
root of T. We start by sorting LL{root{T)), which can be done in 0{n) time. 
Throughout the algorithm we store a global auxiliary array pos[l . . n], required 
in the ChainPrefixMaxgap algorithm. 

Let us find a heaviest path P in T, i.e., a path from the root down to a leaf, 
such that all hanging subtrees are of size at most \T\/2 each. The values of A{v) 
for V € P can all be computed in 0(n) time, using a reduction to the Chain 
Prefix Maxgap Problem (see Fig. 4). 



root 




X4 

Fig. 4. A tree with an example heaviest path P (in bold). The values A{v) for v £ P 
can be computed using a reduction to the Chain Prefix Mcixgap Problem with the sets 
Xi through X4. 

Then we perform the computation rccm-sivcly for the hanging subtrees, pre- 
viously sorting LL(T') for each hanging subtree T' . Such sorting operations can 
be performed in 0{ri) total time for all hanging subtrees. 

At each level of recursion we need a linear amount of time, and the depth of 
recursion is logarithmic. Hence, the total size of invoked Chain Prefix Maxgap 
Problems is 0(n log n). □ 

Now we proceed to the shortest seed computation. In the algorithm we consider 
all factors of u, dividing them into groups corresponding to elements of Nodes (u). 
Let w £ Nodes (u) and let v be its parent. Let s G PREF{w) be a word containing 
f as a proper prefix, i.e., s € PREF{w, \v\ + 1). By Lemma 7, the word s is a 
seed of u if and only if |s| > A{w) and s is a border seed of u. 



Using the previously described reductions (Theorems 6-8), we obtain the 
following algorithm: 



ALGORITHM ComputeShortestSeed(w) 

1: Construct the suffix tree T{u) for the input string u; 

2: Solve the Prefix Maxgap Problem for T{u) using the ChainPrcfixMaxgap 
3: algorithm — in 0(n log n) total time (Theorems 7 and 8); 
4: J" := { PREF{w, max(|t;| + 1, A{w))) : {v,w) is an edge in T{u) }; 
5: return bseed{u,T); { Theorem 6 } 



Observe that the workhorse of the algorithm is the chain version of the Pre- 
fix Maxgap Problem, which has a fairly simple linear time solution. The main 
problem is of a structural nature, we have a collection of very simple problems 
each computable in linear time but the total size is not linear. This identifies the 
bottleneck of the algorithm from the complexity point of view. 



5 Long Seeds 

Note that the most time-expensive part of the ComputeShortestSeed algorithm is 
the computation of prefix maxgaps, all the remaining operations are performed 
in 0{n) time. Using this observation wc can show a more efficient algorithm 
computing the shortest seed provided that its length m is sufficiently large. For 
example if m = 0{n) then we obtain an 0{n) time algorithm for the shortest 
seed. 

Theorem 9. One can check if the shortest seed of a given string u has length at 
least m in 0{n\og{n/m)) tim,e, where n = \u\. If so, a corresponding seed can 
be reported within the same time complexity. 

Proof. We show how to modify the ComputeShortestSeed algorithm. Denote by 

s the shortest seed of 7i, |,s| = to. 

By Observation Ig, the longest overlap between consecutive occurrences of s 
in u is at most ^, therefore the number of occurrences of s in m is at most 
Hence, searching for the shortest seed of length at least to, it suffices to consider 
nodes v of the suffix tree T{u) for which: \v\ > to and |iL(ti)| < 

Thus, we are only interested in prefix maxgaps for nodes in several subtrees 
of T(u), each of which contains 0{n/m) nodes. Thanks to the small size of each 
subtree, the algorithm ComputeShortestSeed finds all such prefix maxgaps in 
0(nlog(n/m)) time. Please note that using this algorithm for each node we 
obtain a prefix maxgap only in its subtree (not necessarily in the whole tree), 
however Lemma 7 can be simply adjusted to such a modified definition of prefix 
maxgaps. □ 
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